A report by the FRED (Federal Reserve of St. Louis) on labour market conditions highlighted a drastic change in software engineer job postings within the past 5 years. Indexed on \(Feb 1, 2020 = 100\), the number of postings expoentially increaes peaking in early 2022 (Index = 240). Yet, seemingly just as rapid, the number of postings fell to a low in late 2023. With numerous tech-unicorns annoucing unprecedneted layoff numbers, the tech bubble has appeared to burst. This paper will evalute the software/data engineering market in 2021 and 2023, showing the differences in available roles, postings by location, and employee skillset requirements.
Figure 1: Line chart of Indeed jobs postings with baseline Feb 1, 2020. The chart is seasonally adjusted on historic patterns in 2017-2019. Each series, including the national trend, occupational sectors, and sub-national geographies, is seasonally adjusted separately.
The two main questions of interest are as follows:
The goal of this paper is to provide clarity into why so many engineers have been struggling to find employment opportunities in North America. Three datasets are used to answer the above questions; 2 Kaggle Datasets and a dataset found on Github.
The first Kaggle dataset was procured by Yazeed Fares, titled Software Engineering Jobs Dataset. The dataset contains 9380 observations and 8 features and was collected via scrapping LinkedIn Jobs. While the scrapping was performed on Dec. 25, 2023, not all jobs were posted on that specific as LinkedIn retains job postings can stay open for up to 6 months. For the purpose of this exploration, we will assume the jobs in this dataset are respresentative of postings from the latter half of 2023.
The second dataset was developed by Mike Lawrence, a Machine Learning Engineer at Google. The dataset contains 8261 observations and 13 features. Similarily, this dataset was also collected from scraped LinkedIn postings; collected in October 2021.
Both datasets contain metadata for postings, but not all columns align, and hence we will subset both datasets into matching features for the purpose of comparison. The variables of interes are listed below. After subsetting the data, all NA values are removed.
| Variables | Type | Description |
|---|---|---|
| Company | character | Name of Company |
| Description | character | Description of job including but not limited to company overview, requirements, skillset |
| Title | character | Name of position |
| Location | character | Location of Job |
| Seniority | character | Classification of role based on experience, technical expertise, leadership responsibilities |
| Year | factor | Year the Job was Posted |
Titles
Due to the structure of job titles additional data-wrangling as
requirement to get proper categoriazation. For example, variance between
each posting could result in different titles corrospond to the same
type of position (e.g. Sr. Software Engineer vs. Senior Software
Engineer). This would affect group_by() functions,
resulting in many more categories than necessary. As a result, custom
title’s are defined based on keyword matches. Using the
case_when() function, titles are classified from spefic to
generic. Thus a title such as “Front-end Software Engineer” gets
classified as Frontend Software Engineer rather than Software
Engineer.
| Title | Pattern |
|---|---|
| Back End Engineer | Back-end, backend |
| Cloud Engineer | Cloud |
| Data Engineer | Data |
| Data Scientist | Data Scientist |
| DevOps Engineer | Devops |
| Embedded Systems Engineer | Embedded, System |
| Front End Engineer | Front-end, front end, frontend |
| Full Stack Engineer | full stack|full-stack |
| Machine Learning Engineer | Machine Learning, AI, Artificial Intelligence |
| Mobile Software Engineer | Mobile, iOS, Android |
| Other | .* |
| QA Engineer | Test, Quality, QA |
| Research Engineer | Research, Scientist |
| Security Engineer | Security, Cyber |
| Site Reliability Engineer | Site Reliability, site-reliability |
| Software Engineer | Software |
Seniority
Analogous to titles, seniority levels are rather inconsistent between the 2 datasets. In the 2021 dataset are 8 levels of seniority while the 2023 dataset only contains 2 classifications. In order to maintain homogeneity between the classifications in both datasets, custom Seniority levels are defined based on keywords in the title (e.g. Staff ~ Staff Level).
| Seniority | Pattern |
|---|---|
| Principal | Principal |
| Staff | Staff |
| Lead | Lead |
| Senior | Sr., Sr, Senior, III |
| Founding | Founding |
| Manager | Manager |
| Junior | Entry Level, Junior, Entry-Level, Graduate, Jr., II, Jr, I |
| Junior | Entry level, Associate |
| Senior | Mid-Senior level |
| None Specified | .* |
GeoData
Additionaly wrangling as required to plot Postings Count ~
Location. The provided location data only contains posting location
formatted by City, State. With a little experimentation,
the data was best visualized, grouped by state and hence, latitude and
longitude coordinate translation was required. After extracting the
State from each location string, the Google
Geocoding API is used to place coordinates for each state. Entries
located outside of the US or with incorrect formatting were dropped
leaving 7258 from 2021 and 6948 from 2023.
The following figures represent EDA on our variables of interest. All varaibles are either factors or textual, hence visualizations are limited to barcharts listing the (top-n) counts grouped by each feature.
Figure 5 shows the comparison of postings in 2021 compared to 2023. Note that this list is not exhaustive of all postings in 2021/2023 and shouldn’t be taken as contradictory evidence to the hypothesis. At the time of collection, this was the number of postings available in each year on LinkedIn. It is very possible that the scrapper missed postings, or volumes are lower/higher at the given point of time the scrapper pulled the dataset.
Figure 5: Comparison of number of postings in 2021 and 2023.
Software Engineer, was the most popular role in both years, but the key difference is a lack of specificity in roles in 2023. The 2021, dataset have over 1000 occurances of MLE, Site Reliability Engineer and Data Scientist postings while the second most in the 2023 dataset is 700 postings of Embedded Systems Engineers. If we disregard the notion that the 2023 dataset wasn’t scrapping for Data Science related roles, the difference between the 2 years isn’t as drastic as we hypothesized.
Figure 6: Comparison of top 10 posting titles in 2021 cs 2023.
One noticeable difference between the two years is the desired seniority level. In 2021 there were 2687 postings for entry-level/junior engineer roles and was the most frequent seniority. However, The 2023 dataset saw a large shift towards senior roles with the vast majority of postins being for Senior Engineer (4183) and also increased increased postings for Staff, Principal and Lead Engineer roles.
Figure 7: Comparison of seniority counts for postings in 2021 vs 2023
Althought this may be attributed to timing of data collection, companies posting in 2021 are more traditional “big-tech” while most postings from 2023 are more scattered. Apple held of 600 postings followed by Microsoft, Uber, Salesforce all with ~100 postings respectively. According to Layoffs.fyi 1,036 tech companies laid off a total of 238,397 employees in the first nine months of 2023. This aligns with the data, showing a lack of postings from popular Saas and Tech Giants. In fact, the most number of postings in 2023 comes from Jobs for Humanity - a platform for “Connecting historically under represented talent to welcoming employers across the globe”.
Figure 8: Companies of company counts in 2021 vs 2023.
Figure 9 represents the geographic locations job opportunities. Each bubble, is indicative of the number of postings in the given state; relative to other bubbles on the map. There isn’t a significant difference between the two years, both seeing the largest number of postings in California, Texas and New York - the “tech hubs” of the U.S.
Figure 9: Maps of The United States showing relative postings counts by state.
From an initial breakdown of datasets and variables, it was discovered that 2021 and 2023 saw slight differences in metadata associated with postings. Location counts, was the only consistent factor between the two years, while Titles, Seniority and Companies data all support the argument of an increased difficulty for job-seekings in 2023. Many popular destinations didn’t post opportunities for New Grads / Entry Levels and sought more senior or leadership positions.
For the next step of the project, it will tackle NLP and Prediction models. Job skilsets, years of experience, and sentientment can be extracted and compared for the two years. An additionaly dataset containing salaries of SWE jobs in 2023 will be introduced to compare wages. A numeric feature, allows for further exploration of variable relationships such as Title~Salary, Location~Salary, Company~Salary. Resultingly, MLR, GLMM, and Boosting models will be trained on the new data to answer question 2 of our hypothesis - salary prediction.